US Traffic Accidents Severity Analysis & Prediction

Group 7

Table of Content

  1. Data Overview
  2. Data Preprocessing
  3. Data Analysis
  4. Feature Engineering
  5. Modelling
  6. Conclusion
  7. Explainable AI

1. Data Overview

1.1. Load Data

The dataset has 1516064 entries and 46 features + 1 target variable. Automatic Type Recognition: 13 of them are bool type, 13 of them are float64, 1 of them is int64, 20 of them are object. Memory usage: 412.1+ MB.

1.4. Data Categorizing

2. Data Preprocessing

2.1. Drop irrelevant columns

2.2. Drop the column with Missing Value(>40%)

2.3. Data type correcting

2.4. Data Imputation

Choose the suitable imputation tech which can highly represent the central tendency of the data.

2.4.1. Drop all NaN/NA/null

2.4.2. Median imputation

2.4.3. Mean imputation

After Data Imputation:

3. Data Analysis

3.1. Basic Analysis

3.1.1. What is the distribution of accidents severity?

As we can see from the graph, level 2 is the most frequent severity which is 76.1% of the total. That means our target variable(label) is quite unbalanced.

3.1.2. What is the relationship between distance and severity?

Generally speaking, the longer the influence distance the higher the severity level will be.

3.2. Location Analysis

3.2.1. What are the top 10 states with the most accidents?

3.2.3. What are the top 10 cities with the most accidents?

3.2.6. What are the accidents distribution by street Side?

Most of the accidents happened at the right side of the road which is quite an interesting finding.

3.3. Time Analysis

3.3.4. Hour

It seems that 7:00/8:00(Go to work) and 16:00/17:00(go back home) are the time where accidents happen during a day.

The peak of the weekend is during 12:00 ~ 16:00.

The decrease of the accidents number at weekends mainly because the decrease at 7:00/8:00(Go to work) and 16:00/17:00(go back home). The peak of the weekend is during 12:00 ~ 16:00.

3.4. Environment Analysis

Freezing rain with windy, light blowing Snow and patches of fog with windy are the top 3 dangeous weather condition.

Temperature(F): lower temperature -> higher Severity
Humidity(%): higher humidity -> higher Severity
Pressure(in): lower pressure -> higher Severity
Visibility(mi): lower visibility -> higher Severity
Wind_Speed(mph): higher wind speed -> higher Severity
In summary, all the results fit with cold, chill, freezing weather condition. Eg. Freezing rain with windy, snow and so on.

3.4.4. Accidents distribution by Sunrise && Sunset

Accidents mostly happened at daytime.

4. Feature Engineering

4.1. Feature choosing

4.2. One Hot Encoding

4.3. Label Encoding

5. Modelling

5.1. Workflow Demonstration

5.1.2. Severity Prediction

5.1.2.3. Decision Tree

5.1.2.4. Gradient Boost Tree

5.1.2.5. Random Forest

5.1.2.6. XGBoost

5.1.2.8. Model Comparison

5.1.2.9. Deal with Imbalanced data

It seems Over-sampling and Under-sampling technique doesn't help to improve the model performance. Need to explore more methods.

5.1.2.10. Severity Prediction Visualization

5.3. Severity Prediction

5.3.1 Parameter Tuning

5.3.2 Model Comparison

Classification Results

Decision Tree

Best Params: {'decisiontreeclassifiermax_depth': 5, 'decisiontreeclassifiermin_impurity_decrease': 0.2, 'decisiontreeclassifier__min_samples_leaf': 2} Best score: 0.816112

Gradient Boosting Algorithm

Best Params: {'gradientboostingclassifierlearning_rate': 0.3, 'gradientboostingclassifiermax_depth': 20, 'gradientboostingclassifiermin_impurity_decrease': 0.5, 'gradientboostingclassifiermin_samples_leaf': 2, 'gradientboostingclassifier__n_estimators': 50} Best score: 0.773303

Random Forest Algorithm

Best Params: {'randomforestclassifiermax_depth': 5, 'randomforestclassifiermin_impurity_decrease': 0.1, 'randomforestclassifiermin_samples_leaf': 2, 'randomforestclassifiern_estimators': 50} Best score: 0.816112

XGB Classifier

Best Params:{'xgbclassifierlearning_rate': 0.1, 'xgbclassifiermax_depth': 5, 'xgbclassifiern_estimators': 50, 'xgbclassifierscale_pos_weight': 5} Best score: 0.775298

6. Conclusion

After performing Machine Learning models like, Random Forest, Decision Tree, XGB Classifier, Gradient Boosting algorithms on this dataset, we conclude that XG Boost Algorithm performs well with 90% of high positive accuracy rate.
Even after tuning the parameters of these algorithms, we got that, Decision Tree and Random Forest Model gave us with 81% positive accuracy rate comparing to other. So, finally for this dataset, XGBoost Model is the best for Car Accident Severity Prediction.

7. Model Explainability

LIME